Skip to content

Conversation

@blakerouse
Copy link
Contributor

@blakerouse blakerouse commented Nov 26, 2025

What does this PR do?

At the moment when a spawned OTEL subprocess fails it is just reported as exit code 1. It provides no information of what has failed and marks the entire Elastic Agent as failed.

This changes that behavior by looking at the actual output to determine the issue and report a proper component status for the entire configuration. It does this by parsing the configuration and building its own aggregated status for the component graph that would be returned by the healthcheckv2 extension if it could successfully run. It inspects the error message to determine if it can correlate the error to a specific component in the graph. If it cannot it falls back to reporting error on all components.

Why is it important?

The Elastic Agent needs to provide clean status reporting even when the subprocess fails to run. It needs to not mark the entire Elastic Agent in error when that happens as well.

Checklist

  • I have read and understood the pull request guidelines of this project.
  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • [ ] I have made corresponding changes to the documentation
  • [ ] I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in ./changelog/fragments using the changelog tool
  • [ ] I have added an integration test or an E2E test(covered by unit tests)

Disruptive User Impact

None

How to test this PR locally

Use either an invalid OTEL configuration or one that will fail to start. Observer that when running elastic-agent run with that OTEL configuration in the elastic-agent.yml filed (aka. Hybrid Mode) that elastic-agent status --output=full provides correct state information.

Related issues

@blakerouse blakerouse self-assigned this Nov 26, 2025
@blakerouse blakerouse requested a review from a team as a code owner November 26, 2025 16:41
@blakerouse blakerouse added Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team backport-active-all Automated backport with mergify to all the active branches labels Nov 26, 2025
@elasticmachine
Copy link
Contributor

Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)

@cmacknz cmacknz requested a review from swiatekm November 26, 2025 18:57
@elasticmachine
Copy link
Contributor

💚 Build Succeeded

History

cc @blakerouse

Copy link
Contributor

@swiatekm swiatekm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic looks good to me, but I have some concerns about deadlocks in the otel manager caused by trying to report status from the main loop.

exitCode := 0
err = cmd.RunCollector(ctx, nil, true, "debug", monitoringURL)
if err != nil && !errors.Is(err, context.Canceled) {
fmt.Fprintln(os.Stderr, err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI, if you use otelcol builder, its generated main.go uses log.Fatalf here. We should do the same if we can.

// pipelines
for pipelineID, pipelineCfg := range c.Service.Pipelines {
for _, recvID := range pipelineCfg.Receivers {
instanceID := componentstatus.NewInstanceID(recvID, component.KindReceiver, pipelineID)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's worth commenting that the upstream graph building code creates a single instance id for each receiver, containing information about all the pipelines it appears in. The status reporting then makes a copy of the status for each pipeline, so it's fine for us to do so here as well, but it's worth calling out imo.

Comment on lines +191 to +192
forIDStr := fmt.Sprintf("for id: %q", instanceID.ComponentID().String())
failedMatchStr := fmt.Sprintf("failed to start %q %s:", instanceID.ComponentID().String(), strings.ToLower(instanceID.Kind().String()))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'd be good to specify what conditions these lines refer to. The one about starting the component is relatively obvious, but forIDStr, not so much. Is this an error about the otel collector not knowing about the component type?


// reportStartupErr maps this error to the *status.AggregateStatus.
// this is done by parsing the `m.mergedCollectorCfg` and converting it into the best effort *status.AggregateStatus.
func (m *OTelManager) reportStartupErr(ctx context.Context, err error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing I don't like about this function, is that it pretends to report an error, but actually reports a status. We have different delivery requirements for these. Reporting errors is non-blocking, as a newer error can always overwrite an older error. The same is not true about statuses - we need to make sure all the component statuses are delivered, or else we get bugs like the one in #10675.

As a result, reportStartupErr needs to be called with care, as it can potentially deadlock the manager. I wonder if we should bite the bullet and just make the update channel buffered with some reasonable size, and emit a fatal error if it can't be written to. What do you think?

Copy link
Member

@cmacknz cmacknz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ran the collector with the following configuration which should not run:

outputs:
  default:
    type: elasticsearch
    hosts: [127.0.0.1:9200]
    api_key: "example-key"
    #username: "elastic"
    #password: "changeme"
    preset: balanced
    otel:
      exporter:
        not_a_setting: true

It does not run as expected and the collector exits with the following which looks a lot like a multi-line error but we manage to grab the last line which at least contains the configuration key name:

{"log.level":"info","@timestamp":"2025-11-27T19:32:59.575Z","message":"failed to get config: cannot unmarshal the configuration: decoding failed due to the following error(s):","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2025-11-27T19:32:59.575Z","message":"'exporters' error reading configuration for \"elasticsearch/_agent-component/monitoring\": decoding failed due to the following error(s):","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2025-11-27T19:32:59.575Z","message":"'' decoding failed due to the following error(s):","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2025-11-27T19:32:59.575Z","message":"'' has invalid keys: not_a_setting","ecs.version":"1.6.0"}
❯ sudo elastic-development-agent status
┌─ fleet
│  └─ status: (STOPPED) Not enrolled into Fleet
└─ elastic-agent
   ├─ status: (DEGRADED) 1 or more components/units in a failed state
   ├─ beat/metrics-monitoring
   │  ├─ status: (FAILED) FAILED
   │  ├─ beat/metrics-monitoring
   │  │  └─ status: (FAILED) '' has invalid keys: not_a_setting
   │  └─ beat/metrics-monitoring-metrics-monitoring-beats
   │     └─ status: (FAILED) '' has invalid keys: not_a_setting
   ├─ filestream-monitoring
   │  ├─ status: (FAILED) FAILED
   │  ├─ filestream-monitoring
   │  │  └─ status: (FAILED) '' has invalid keys: not_a_setting
   │  └─ filestream-monitoring-filestream-monitoring-agent
   │     └─ status: (FAILED) '' has invalid keys: not_a_setting
   ├─ http/metrics-monitoring
   │  ├─ status: (FAILED) FAILED
   │  ├─ http/metrics-monitoring
   │  │  └─ status: (FAILED) '' has invalid keys: not_a_setting
   │  └─ http/metrics-monitoring-metrics-monitoring-agent
   │     └─ status: (FAILED) '' has invalid keys: not_a_setting
   ├─ prometheus/metrics-monitoring
   │  ├─ status: (FAILED) FAILED
   │  ├─ prometheus/metrics-monitoring
   │  │  └─ status: (FAILED) '' has invalid keys: not_a_setting
   │  └─ prometheus/metrics-monitoring-metrics-monitoring-collector
   │     └─ status: (FAILED) '' has invalid keys: not_a_setting
   └─ extensions
      ├─ status: StatusFatalError ['' has invalid keys: not_a_setting]
      └─ extension:healthcheckv2/bcc9882f-6e23-4ec6-b6a7-0abecb4c2ded
         └─ status: StatusFatalError ['' has invalid keys: not_a_setting]

That looks correct enough except for us leaking out the healthcheck extension status verbatim (but not beatsauth or the diagnostics extension) which we probably shouldn't do (or at least it is inconsistent).

// The flow of this function comes from https://github.com/open-telemetry/opentelemetry-collector/blob/main/service/internal/graph/graph.go
// It's a much simpler version, but follows the same for loop ordering and building of connectors of the internal
// graph system that OTEL uses to build its component graph.
func otelConfigToStatus(cfg *confmap.Conf, err error) (*status.AggregateStatus, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is complex enough to have its own tests. The configurations in manager_test.go right now don't cover it all e.g. there are no configurations with connectors.

To me it seems like tests for this should focus on mapping status for complex configurations and the tests in manager_test.go can focus purely on ensuring various failure scenarios result in capturing a status regardless of what the configuration is.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport-active-all Automated backport with mergify to all the active branches Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team

Projects

None yet

4 participants